Analysis notebook :
This notebook is a short version of the EDA notebook for cloud-price-data and contains the following results :
- Count of unique instances types
- Type of feature: numerical, categorical, bag of words
- Smple of feature values
- Distrubition of features (unique values / feature)
Importing the required python libraries and the sample dataset from the data/processed folder
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
df = pd.read_csv("../../data/processed/processed.csv")
d = df.copy()
sns.set(style="whitegrid")/opt/app-root/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (30,61,63,64) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Dataframe for aws-pricing dataset.
df.head()| servicecode | location | locationType | instanceType | currentGeneration | instanceFamily | vcpu | physicalProcessor | storage | networkPerformance | ... | instanceCapacityMetal | elasticGraphicsType | instanceCapacityMedium | productType | instanceCapacity32xlarge | maxIopsBurstPerformance | provisioned | memory_num | clockSpeed_num | gpuMemory_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AmazonEC2 | Asia Pacific (Tokyo) | AWS Region | d2.xlarge | Yes | Storage optimized | 4.0 | Intel Xeon E5-2676 v3 (Haswell) | 3 x 2000 HDD | Moderate | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 30.5 | 2.4 | NaN |
| 1 | AmazonEC2 | EU (Frankfurt) | AWS Region | m5d.24xlarge | Yes | General purpose | 96.0 | Intel Xeon Platinum 8175 (Skylake) | 4 x 900 NVMe SSD | 25 Gigabit | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 384.0 | 3.1 | NaN |
| 2 | AmazonEC2 | US East (N. Virginia) | AWS Region | r5dn.24xlarge | Yes | Memory optimized | 96.0 | Intel Xeon Platinum 8259 (Cascade Lake) | 4 x 900 NVMe SSD | 100 Gigabit | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 768.0 | 3.1 | NaN |
| 3 | AmazonEC2 | US West (Oregon) | AWS Region | c3.2xlarge | No | Compute optimized | 8.0 | Intel Xeon E5-2680 v2 (Ivy Bridge) | 2 x 80 SSD | High | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15.0 | 2.8 | NaN |
| 4 | AmazonEC2 | Asia Pacific (Singapore) | AWS Region | m5dn.2xlarge | Yes | General purpose | 8.0 | Intel Xeon Platinum 8259 (Cascade Lake) | 1 x 300 NVMe SSD | Up to 25 Gigabit | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32.0 | 3.1 | NaN |
5 rows × 68 columns
Unique instanceTypes
This dataframe (unique_instanceType_frame) displays all the unique instanceType(s) in this dataset.
unique_instancetype = df["instanceType"].unique()Length of the list of all unique instanceType(s) gives us total number of unique instanceType(s) in this sample of the dataset.
len(unique_instancetype)351Convert the list to a dataframe for better understanding of contents
unique_instancetype_frame = pd.DataFrame(np.array(unique_instancetype))
unique_instancetype_frame.columns = ["unique_instancetype"]
display(unique_instancetype_frame)| unique_instancetype | |
|---|---|
| 0 | d2.xlarge |
| 1 | m5d.24xlarge |
| 2 | r5dn.24xlarge |
| 3 | c3.2xlarge |
| 4 | m5dn.2xlarge |
| ... | ... |
| 346 | h1 |
| 347 | u-6tb1 |
| 348 | m3 |
| 349 | m5n |
| 350 | x1e |
351 rows × 1 columns
Total number of unique instanceType : 351
df_instanceType : DataFrame with counts of unique elements in each position.
df_instancetype = df.groupby("instanceType").nunique()
df_instancetype| servicecode | location | locationType | instanceType | currentGeneration | instanceFamily | vcpu | physicalProcessor | storage | networkPerformance | ... | instanceCapacityMetal | elasticGraphicsType | instanceCapacityMedium | productType | instanceCapacity32xlarge | maxIopsBurstPerformance | provisioned | memory_num | clockSpeed_num | gpuMemory_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| instanceType | |||||||||||||||||||||
| a1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| a1.2xlarge | 1 | 8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| a1.4xlarge | 1 | 7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| a1.large | 1 | 6 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| a1.medium | 1 | 9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| z1d.3xlarge | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| z1d.6xlarge | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| z1d.large | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| z1d.metal | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| z1d.xlarge | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
350 rows × 68 columns
Numerical Features
Below shown table shows if the values in the column are numeric(True) or not(False)
is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
df_features = pd.DataFrame(is_numeric_col)
df_features = df_features.rename(columns={0: "isNumeric"})
with pd.option_context("display.max_rows", None):
display(df_features)| isNumeric | |
|---|---|
| servicecode | False |
| location | False |
| locationType | False |
| instanceType | False |
| currentGeneration | False |
| instanceFamily | False |
| vcpu | True |
| physicalProcessor | False |
| storage | False |
| networkPerformance | False |
| processorArchitecture | False |
| tenancy | False |
| operatingSystem | False |
| licenseModel | False |
| usagetype | False |
| operation | False |
| capacitystatus | False |
| dedicatedEbsThroughput | False |
| ecu | False |
| enhancedNetworkingSupported | False |
| instancesku | False |
| intelAvxAvailable | False |
| intelAvx2Available | False |
| intelTurboAvailable | False |
| normalizationSizeFactor | True |
| preInstalledSw | False |
| processorFeatures | False |
| servicename | False |
| id | False |
| gpu | True |
| ebsOptimized | False |
| transferType | False |
| fromLocation | False |
| fromLocationType | False |
| toLocation | False |
| toLocationType | False |
| group | False |
| groupDescription | False |
| resourceType | False |
| instance | False |
| instanceCapacity12xlarge | True |
| instanceCapacity2xlarge | True |
| instanceCapacityLarge | True |
| instanceCapacityXlarge | True |
| physicalCores | True |
| instanceCapacity18xlarge | True |
| instanceCapacity4xlarge | True |
| instanceCapacity9xlarge | True |
| instanceCapacity10xlarge | True |
| instanceCapacity16xlarge | True |
| instanceCapacity8xlarge | True |
| instanceCapacity24xlarge | True |
| storageMedia | False |
| volumeType | False |
| maxVolumeSize | False |
| maxIopsvolume | False |
| maxThroughputvolume | False |
| volumeApiName | False |
| instanceCapacityMetal | True |
| elasticGraphicsType | False |
| instanceCapacityMedium | True |
| productType | False |
| instanceCapacity32xlarge | True |
| maxIopsBurstPerformance | False |
| provisioned | False |
| memory_num | True |
| clockSpeed_num | True |
| gpuMemory_num | True |
all_features = df_features.T.columnsFrom the above dataframe, extracting only the numerical features into the list numerical_fetaures and further making a dataframe - num_feat for better undertanding of the contents.
numerical_features = is_numeric_col[is_numeric_col]
n = numerical_features.index
num_feat = pd.DataFrame(n)
num_feat.rename(columns={0: "Numerical Features"})| Numerical Features | |
|---|---|
| 0 | vcpu |
| 1 | normalizationSizeFactor |
| 2 | gpu |
| 3 | instanceCapacity12xlarge |
| 4 | instanceCapacity2xlarge |
| 5 | instanceCapacityLarge |
| 6 | instanceCapacityXlarge |
| 7 | physicalCores |
| 8 | instanceCapacity18xlarge |
| 9 | instanceCapacity4xlarge |
| 10 | instanceCapacity9xlarge |
| 11 | instanceCapacity10xlarge |
| 12 | instanceCapacity16xlarge |
| 13 | instanceCapacity8xlarge |
| 14 | instanceCapacity24xlarge |
| 15 | instanceCapacityMetal |
| 16 | instanceCapacityMedium |
| 17 | instanceCapacity32xlarge |
| 18 | memory_num |
| 19 | clockSpeed_num |
| 20 | gpuMemory_num |
From the list of all features, all the features that are not numerical will be categorical features. Adding them to the list categorial and further making a dataframe - cat_feat for better understandin of the contents.
categorical = [x for x in all_features if x not in numerical_features]
cat_feat = pd.DataFrame(categorical)
with pd.option_context("display.max_rows", None):
display(cat_feat.rename(columns={0: "Categorical Features"}))| Categorical Features | |
|---|---|
| 0 | servicecode |
| 1 | location |
| 2 | locationType |
| 3 | instanceType |
| 4 | currentGeneration |
| 5 | instanceFamily |
| 6 | physicalProcessor |
| 7 | storage |
| 8 | networkPerformance |
| 9 | processorArchitecture |
| 10 | tenancy |
| 11 | operatingSystem |
| 12 | licenseModel |
| 13 | usagetype |
| 14 | operation |
| 15 | capacitystatus |
| 16 | dedicatedEbsThroughput |
| 17 | ecu |
| 18 | enhancedNetworkingSupported |
| 19 | instancesku |
| 20 | intelAvxAvailable |
| 21 | intelAvx2Available |
| 22 | intelTurboAvailable |
| 23 | preInstalledSw |
| 24 | processorFeatures |
| 25 | servicename |
| 26 | id |
| 27 | ebsOptimized |
| 28 | transferType |
| 29 | fromLocation |
| 30 | fromLocationType |
| 31 | toLocation |
| 32 | toLocationType |
| 33 | group |
| 34 | groupDescription |
| 35 | resourceType |
| 36 | instance |
| 37 | storageMedia |
| 38 | volumeType |
| 39 | maxVolumeSize |
| 40 | maxIopsvolume |
| 41 | maxThroughputvolume |
| 42 | volumeApiName |
| 43 | elasticGraphicsType |
| 44 | productType |
| 45 | maxIopsBurstPerformance |
| 46 | provisioned |
Displaying the contents of the dataframe that have categorical data
with pd.option_context("display.max_columns", None):
display(df[categorical].head())| servicecode | location | locationType | instanceType | currentGeneration | instanceFamily | physicalProcessor | storage | networkPerformance | processorArchitecture | tenancy | operatingSystem | licenseModel | usagetype | operation | capacitystatus | dedicatedEbsThroughput | ecu | enhancedNetworkingSupported | instancesku | intelAvxAvailable | intelAvx2Available | intelTurboAvailable | preInstalledSw | processorFeatures | servicename | id | ebsOptimized | transferType | fromLocation | fromLocationType | toLocation | toLocationType | group | groupDescription | resourceType | instance | storageMedia | volumeType | maxVolumeSize | maxIopsvolume | maxThroughputvolume | volumeApiName | elasticGraphicsType | productType | maxIopsBurstPerformance | provisioned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AmazonEC2 | Asia Pacific (Tokyo) | AWS Region | d2.xlarge | Yes | Storage optimized | Intel Xeon E5-2676 v3 (Haswell) | 3 x 2000 HDD | Moderate | 64-bit | Shared | Windows | No License required | APN1-UnusedBox:d2.xlarge | RunInstances:0002 | UnusedCapacityReservation | 750 Mbps | 14 | Yes | NZHXGSV3KSMEQT45 | Yes | Yes | Yes | NaN | Intel AVX; Intel AVX2; Intel Turbo | Amazon Elastic Compute Cloud | Y9FZ6K8HF7D54PQK | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | AmazonEC2 | EU (Frankfurt) | AWS Region | m5d.24xlarge | Yes | General purpose | Intel Xeon Platinum 8175 (Skylake) | 4 x 900 NVMe SSD | 25 Gigabit | 64-bit | Dedicated | Linux | No License required | EUC1-DedicatedRes:m5d.24xlarge | RunInstances:0200 | AllocatedCapacityReservation | 12000 Mbps | 337 | Yes | KVTTH3ZWAM6J3RS5 | Yes | Yes | Yes | SQL Web | Intel AVX; Intel AVX2; Intel AVX512; Intel Turbo | Amazon Elastic Compute Cloud | F4MU8VH495AU9238 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | AmazonEC2 | US East (N. Virginia) | AWS Region | r5dn.24xlarge | Yes | Memory optimized | Intel Xeon Platinum 8259 (Cascade Lake) | 4 x 900 NVMe SSD | 100 Gigabit | 64-bit | Shared | Windows | No License required | BoxUsage:r5dn.24xlarge | RunInstances:0002 | Used | 12000 Mbps | NaN | No | NaN | No | No | No | NaN | NaN | Amazon Elastic Compute Cloud | KB8CP4XD7SBD3H3F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | AmazonEC2 | US West (Oregon) | AWS Region | c3.2xlarge | No | Compute optimized | Intel Xeon E5-2680 v2 (Ivy Bridge) | 2 x 80 SSD | High | 64-bit | Shared | Windows | No License required | USW2-UnusedBox:c3.2xlarge | RunInstances:0102 | UnusedCapacityReservation | NaN | 28 | Yes | GQDUSYAUZVGTEET2 | Yes | No | Yes | SQL Ent | Intel AVX; Intel Turbo | Amazon Elastic Compute Cloud | PTXRHKVQTWX2U9QH | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | AmazonEC2 | Asia Pacific (Singapore) | AWS Region | m5dn.2xlarge | Yes | General purpose | Intel Xeon Platinum 8259 (Cascade Lake) | 1 x 300 NVMe SSD | Up to 25 Gigabit | 64-bit | Host | Windows | No License required | APS1-HostBoxUsage:m5dn.2xlarge | RunInstances:0002 | Used | Up to 3500 Mbps | NaN | No | NaN | No | No | No | NaN | NaN | Amazon Elastic Compute Cloud | GZ8NPDY8GZ37SBXN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Displaying the contents of the dataframe that have numerical data
num_feat = [x for x in all_features if x not in categorical]
with pd.option_context("display.max_columns", None):
display(df[num_feat].head().fillna(0))| vcpu | normalizationSizeFactor | gpu | instanceCapacity12xlarge | instanceCapacity2xlarge | instanceCapacityLarge | instanceCapacityXlarge | physicalCores | instanceCapacity18xlarge | instanceCapacity4xlarge | instanceCapacity9xlarge | instanceCapacity10xlarge | instanceCapacity16xlarge | instanceCapacity8xlarge | instanceCapacity24xlarge | instanceCapacityMetal | instanceCapacityMedium | instanceCapacity32xlarge | memory_num | clockSpeed_num | gpuMemory_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.5 | 2.4 | 0.0 |
| 1 | 96.0 | 192.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 384.0 | 3.1 | 0.0 |
| 2 | 96.0 | 192.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 768.0 | 3.1 | 0.0 |
| 3 | 8.0 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15.0 | 2.8 | 0.0 |
| 4 | 8.0 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 32.0 | 3.1 | 0.0 |
Graphs for unique value ditribution
Compare the distribution of the existing features.
The function plot_bar_subplots allows the user to make bar graphs that will represent the percentage of a particular value for a feature.
plt.rcParams.update({"figure.max_open_warning": 0})def plot_bar_subplots(sub_df, width, height, bar_height):
features = sub_df.columns
height = height * len(features)
fig, ax = plt.subplots(len(features), 1, figsize=(width, height))
for i in range(len(features)):
feature_freqs = sub_df[features[i]].value_counts(normalize=True) * 100
feature_freq_df = pd.DataFrame(
feature_freqs.sort_index(inplace=False, ascending=True)
).reset_index()
g = sns.barplot(
x=features[i],
y="index",
data=feature_freq_df,
ax=ax[i],
orient="h",
dodge=False,
)
ax[i].set_ylabel(features[i], fontsize=20, va="center")
ax[i].tick_params(axis="y", which="minor")
ax[i].set_xlim(0, 100)
ax[i].set_xlabel("Normalized Value Counts (%)", fontsize=20)
ax[i].set_title(features[i], fontsize=30)
def change_height(ax, new_value):
for patch in ax.patches:
patch.set_height(new_value)
change_height(g, bar_height)
fig.tight_layout()
def plot_bar(dataframe, n):
features = dataframe.columns
is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
df_features = pd.DataFrame(is_numeric_col)
numerical_features = [x for x in features if df_features.T[x][0] is True]
categorical = [x for x in features if x not in numerical_features]
small_features = [
feature
for feature in categorical
if len(dataframe[feature].unique()) < n
]
large_features = [x for x in categorical if x not in small_features]
dataframe[numerical_features] = (
dataframe[numerical_features]
.replace(["NA", "Variable"], "0")
.astype("float")
)
## Plots for features with numerical values
plot_bar_subplots(dataframe[numerical_features], 30, 30, 0.6)
## Plots for features with many values
plot_bar_subplots(dataframe[large_features], 30, 120, 0.6)
## Plots for features with less values
plot_bar_subplots(dataframe[small_features], 40, 7, 0.6)Implementing the function plot_bar on the given dataset.
- Removing some "all unique" kind of features to avoid disturbance in the graphs.
df_graph = df.copy() # removing all unique values for simplification purposes
df_graph = df_graph.reset_index()
df_graph.drop(
columns=["id", "instancesku", "usagetype", "index"], inplace=True
)
df_graph.dropna()
dd = pd.DataFrame(df_graph)
plot_bar(dd, 30)<Figure size 2160x0 with 0 Axes>- The above shown bar graphs represent frequency distribution for each feature.
- The x-axis represents the number count or percentage of occurrences in the data for each column and can be used to visualize data distributions.
- The y-axis represents the various values a features may have.
